Introduction

This file goes through and performs several classification models on the data. It’s primary use was to learn how to use different algorithms while also gathering a fuller picture of the data. As a bonus, these models were put to good use when deciding the correct data imputation method to use for missing data at a time when the collection function was more… primitive.

Findings change over time. I tried to make write-up as flexible as possible with inline code chunks, but–as more data gets added–the results from these tests may make the writing inconsistent with the results. To see the results from when these tests were first performed, change the input to read_csv() so that it reads in the csv file found in the “Analysis” folder.

This was written under R version 4.0.3.

The Dataset

The La Junta dataset is quickly becoming one of my favorites. I suppose there is something to be said about collecting, cleaning, and analyzing data all by yourself.

if(!require(pacman)) install.packages("pacman")

# General Packages
pacman::p_load(dplyr, readr, ggplot2,
               patchwork, lubridate,
               knitr, broom, stringr,
               plotly)

# Imputation Packages
pacman::p_load(VIM)

# KNN Packages
pacman::p_load(rsample, class)

# Logistic Regression Packages
pacman::p_load(pscl, caret, car,
               InformationValue)

# LDA Packages
pacman::p_load(MASS)

# SVM Packages
pacman::p_load(e1071)

# The raw data for La Junta, which is on GitHub
# (as opposed to the local file):
lajunta <- read_csv(paste0("https://raw.githubusercontent.com/", 
                           "Ckrenzer/Winter-Livestock-Data/main/",
                           "La%20Junta%20Market%20Reports.csv"),
                    col_types = cols(Date = "D",
                                     Buyer = col_character(),
                                     Quantity = col_double(),
                                     Type = col_character(),
                                     Weight = col_double(),
                                     Price = col_double(),
                                     URL =  col_character(),
                                     Reprod = col_character()))

# Know that I only used paste0() on the url
# to avoid getting into that dark LaTex magic
# for pdf output!


# Removing the URL column
# Remove missing values for Weight and Price--missing values
# for Reprod will be imputed.
# Missing values for neither Date nor Quantity will be imputed or filtered out.
lajunta <- lajunta %>% 
  dplyr::select(-URL) %>% 
  filter(!is.na(lajunta$Weight),
         !is.na(lajunta$Price))

To familiarize you with the dataset, let’s take a quick peek:

Date Quantity Type Weight Price Reprod
2016-01-05 10 ang 410 220 str
2016-01-05 24 ang 490 213 str
2016-01-05 5 ang 400 197 hfr
2016-01-05 3 lim 475 214 str
2016-01-05 4 lim 525 205 str
2016-01-05 3 lim 430 185 hfr
2016-01-05 11 clr 485 210 str
2016-01-05 11 clr 450 180 hfr

You can see that we have data for the date of the sale, the quantity sold, the type of livestock, the weight in pounds, the price in USD, and the livestock’s reproductive status. There is also data on Buyer names, but I thought it would be rude to include that information in this table. It hasn’t proved very useful anyway.

From my previous work (see “Overview.Rmd,” for instance), I can confidently say that the “Type” column is utterly useless. That’s a story in and of itself, I suppose. The simplest graph that tells the bulk of what we need to know is this one:

The dark theme on ggplot looks really nice, don’t you think? We can clearly see four distinct groups and will explore these relationships in various ways.

Imputation

Some of the data contains missing values. I will not impute missing data for Date, but the Reprod column can be reasonably estimated. The next section will walk through KNN in-depth, but just know that it will be used to impute missing values. The VIM package is used to perform this algorithm.

lajunta <- VIM::kNN(data = lajunta,
                    variable = "Reprod",
                    dist_var = c("Price", "Weight", "Quantity"))

KNN Clustering

Since the data clusters so nicely, I could not resist checking KNN’s performance!

Step 1: Packages

I will be using the rsample package to split the data into testing and training sets. The class package will be used for the KNN algorithm. With this algorithm, I will be classifying cattle reproductive status (cow, bull, steer, or heifer) using weight and market price.

Step 2: Prepare Data

Lucky for me, I already spent a full day cleaning this data! All I have to do is select relevant columns, make the character column a factor, and standardize the numeric columns.

There are only two predictors in this model, weight and price.

# Fortunately, KNN is well-suited for just a few predictors
lajunta_knn <- lajunta %>% 
  dplyr::select(Weight, Price, Reprod)

# storing the string vector as a factor
lajunta_knn$Reprod <- as.factor(lajunta_knn$Reprod)

Since KNN is sensitive to different scales, the values of the predictors need to be standardized.

# The normalizing function, which transforms
# all values into the range from 0 to 1
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) 
}


# Applying normalize() on the predictor
# columns
lajunta_knn <- lajunta_knn %>% 
  mutate(across(where(is.numeric), normalize))

Step 3: Separate into Training and Test Sets

A stratified, simple random sampling technique will be used to separate values into training and testing data. Stratification is necessary in this instance because there are far fewer cows and bulls in the dataset than there are heifers or steers. Eighty percent of the values will be placed in the training set and the remaining twenty percent will go to the test set.

# Setting the random number stream
set.seed(2021)
lajunta_knn_split <- rsample::initial_split(lajunta_knn,
                                            prob = 0.80,
                                            strata = Reprod)

# As of 2/6/2021, the split is 869 in train
# and 289 in test, with a total of 1158 observations

# separating the data into two new data frames
lajunta_knn_train <- rsample::training(lajunta_knn_split)
lajunta_knn_test <- rsample::testing(lajunta_knn_split)

Step 4: Build Model

To determine the appropriate number for K, I took the square root of the number of observations in the training set.

# Determining the appropriate value of K
optimal_k <- floor(sqrt(nrow(lajunta_knn_train)))

cat("Optimal value of K:", optimal_k)
## Optimal value of K: 117
# To explain the variable names, know that the
# value of optimal_k is 29 (and some change)
# at the time of writing

# the cl argument is the factor of the true
# classifications from the training set
knn_29 <- class::knn(train = lajunta_knn_train[1:2],
                     test = lajunta_knn_test[1:2],
                     cl = lajunta_knn_train$Reprod,
                     k = optimal_k)

Step 5: Evaluate

To calculate the accuracy, the data was thrown into a confusion matrix and the number of correct classifications were calculated as a rate. Though not shown, all cows and bulls are predicted perfectly. The only data which are not classified correctly are steers and heifers.

# Create a confusion matrix--The result from knn()
# and the test data dependent variable are the
# arguments
knn_29_mat <- table(knn_29, lajunta_knn_test$Reprod)


#This function measures accuracy. It takes the confusion matrix as input
accuracy <- function(x){
  return(sum(diag(x)/(sum(rowSums(x)))) * 100)
}


cat("Accuracy", accuracy(knn_29_mat)) # 87.2% as of 2/6/2021
## Accuracy 78.05944

Step 6: Optimize

We can always do better, no? Let’s run this again in a loop with changing values of K. The K with the greatest accuracy should be used for all future predictions. Also, this value should be somewhat close to 117–if we get a number very far from 117, it is likely that we are overfitting the model. That pesky bias-variance trade-off…

# Since I am using a loop, best practice
# is to pre-allocate space in the vector
acc <- numeric(100)
for(i in 1:100){
  knn_mod <- class::knn(train = lajunta_knn_train[1:2],
                        test = lajunta_knn_test[1:2],
                        cl = lajunta_knn_train$Reprod,
                        k = i)
  acc[i] <- accuracy(table(knn_mod, lajunta_knn_test$Reprod))
  
}

The ideal value was 49–very close to 117! Using just two variables, we can predict the reproductive status of livestock with 78.67 percent accuracy! As previously stated, most of these discrepancies are between steers and heifers. If we are only interested in cows and bulls, we will virtually always have the right prediction!

Adding in Quantity

I am going to repeat this process with Quantity added to the model. Will the accuracy improve?

The accuracy after adding Quantity into the model at the optimal value of K is 78.39 percent–slightly worse than the model with only two variables. The more you know the better!

Logistic Regression

There are a few things I want to try with a logistic regression. Though logistic regression can handle more than two categories, it is best-suited for binary dependent variables. Therefore, I will be evaluating three models with binary outcomes:

  1. Determining whether livestock is a steer or a heifer.
  2. Determining whether livestock is a bull or a cow.
  3. Determining whether livestock is either a bull/cow or steer/heifer.

Due to enormous differences between cows/bulls and heifers/steers, I expect zero error in the third regression.

Setup and Cleaning

First things first. In the first regression, only data for steers and heifers will be included. In the second regression, only data for bulls and cows will be included. In the third regression, steers and heifers will be recoded as “y,” for “young,” while cows and bulls will be recoded as “par,” for “parents” (and yes, I understand that bulls are not necessarily fathers; I had trouble coming up with a good name for these categories). With this in mind, let’s store data for the models:

# Data for regression 1
lajunta_logit1 <- lajunta %>% 
  filter(Reprod %in% c("str", "hfr")) %>% 
  mutate(across(where(is.character), as.factor))


# Data for regression 2
lajunta_logit2 <- lajunta %>% 
  filter(Reprod %in% c("bull", "cow")) %>% 
  mutate(across(where(is.character), as.factor))

# Data for regression 3
lajunta_logit3 <- lajunta %>% 
  mutate(Reprod = lajunta[["Reprod"]] %>% 
           str_replace_all("bull|cow", "par") %>% 
           str_replace_all("str|hfr", "y")) %>% 
  mutate(across(where(is.character), as.factor))

Now that we have everything in a tibble, we are ready to split up the data.

Split into Training and Test Sets

The data will be split into testing and training sets with an 80/20 split. For the same reasons used in the KNN classification section, stratified sampling will be used.

# REGRESSION 1 (steer vs heifer)
# Setting the random number stream
set.seed(2021)
# Stratification for the first regression
lajunta_logit1_split <- rsample::initial_split(lajunta_logit1,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit1_train <- rsample::training(lajunta_logit1_split)
lajunta_logit1_test <- rsample::testing(lajunta_logit1_split)



# REGRESSION 2 (bull vs cow)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit2_split <- rsample::initial_split(lajunta_logit2,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit2_train <- rsample::training(lajunta_logit2_split)
lajunta_logit2_test <- rsample::testing(lajunta_logit2_split)





# REGRESSION 3 (bull/cow vs. steer/heifer)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit3_split <- rsample::initial_split(lajunta_logit3,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit3_train <- rsample::training(lajunta_logit3_split)
lajunta_logit3_test <- rsample::testing(lajunta_logit3_split)

Models

The data is now ready for model assembly. I will be using the stats package’s glm() function as the engine for the regressions.

# REGRESSION 1 (steer vs heifer)
# Quantity is not very different between steers and heifers
logit1 <- glm(data = lajunta_logit1_train,
              formula = Reprod ~ Weight + Price,
              family = "binomial")

# REGRESSION 2 (bull vs cow)
logit2 <- glm(data = lajunta_logit2_train,
              formula = Reprod ~ Weight + Price + Quantity,
              family = "binomial")

# REGRESSSION 3 (bull/cow vs. steer/heifer)
logit3 <- glm(data = lajunta_logit3_train,
              formula = Reprod ~ Weight + Price + Quantity,
              family = "binomial")

Here are the results from the first regression (steers vs heifers):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price, family = "binomial", data = lajunta_logit1_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.9112   0.3727   0.8601   2.5112  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.717e+01  3.781e-01  -45.40   <2e-16 ***
## Weight       9.442e-03  2.506e-04   37.68   <2e-16 ***
## Price        7.704e-02  1.684e-03   45.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14900  on 10797  degrees of freedom
## Residual deviance: 11513  on 10795  degrees of freedom
## AIC: 11519
## 
## Number of Fisher Scoring iterations: 6

From the second regression (bull vs cow):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial", 
##     data = lajunta_logit2_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3025  -0.0130   0.0084   0.0535   2.6986  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 42.69105    2.49828  17.088   <2e-16 ***
## Weight      -0.01227    0.00080 -15.341   <2e-16 ***
## Price       -0.29728    0.01947 -15.271   <2e-16 ***
## Quantity     0.13695    0.07282   1.881     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3728.1  on 2921  degrees of freedom
## Residual deviance:  428.0  on 2918  degrees of freedom
## AIC: 436
## 
## Number of Fisher Scoring iterations: 9

And the third regression (bull/cow vs. steer/heifer):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial", 
##     data = lajunta_logit3_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3554   0.0005   0.0268   0.0764   3.6607  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.4502350  1.1234154   7.522  5.4e-14 ***
## Weight      -0.0111256  0.0007451 -14.933  < 2e-16 ***
## Price        0.0089049  0.0046791   1.903    0.057 .  
## Quantity     0.1878244  0.0145354  12.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14210.8  on 13720  degrees of freedom
## Residual deviance:  1218.8  on 13717  degrees of freedom
## AIC: 1226.8
## 
## Number of Fisher Scoring iterations: 9

The predictors in the first regression did a pretty good job and, judging by the p-values, the predictors in the second and third regressions did an abysmal job!

The correct interpretation of the Price variable in the first regression, however, is that a one dollar increase in the price of the livestock is associated with an average increase of 0.0770409 in the log odds of being a steer.

Well, this is quite strange. Let’s continue on to the evaluation and see if we can figure out what’s happening.

Evaluation

McFadden’s R-square

We can use the pR2() function from the pscl package to compute McFadden’s R-square value, which ranges from 0 to just under 1. Values close to 0 indicate the model as no predictive power. Values over 0.4 suggest the model is a good fit for the data.

Let’s begin with the first regression:

## fitting null model for pseudo-r2
## 0.2273094

As expected, the predictors in the first regression do a pretty good job. Let’s see the McFadden statistic from the second regression:

## fitting null model for pseudo-r2
## 0.8851954

Ah! I see now. The model acts strangely because it is deterministic! I expect this to be the case for the third regression as well:

## fitting null model for pseudo-r2
## 0.9142342

Yup.

Variable Importance

We can compute the importance of each predictor variable in the model by using the varImp() function from the caret package. Higher values indicate more importance. These values can help confirm the validity of p-values.

The first regression:

Overall
Weight 37.67728
Price 45.73869

The values are both pretty big, so both of the predictors seem to be ‘carrying their weight.’ This is consistent with our analysis thus far.

What about the second regression?

Overall
Weight 15.340758
Price 15.271488
Quantity 1.880659

That is certainly consistent with the p-values. The third regression?

Overall
Weight 14.932639
Price 1.903113
Quantity 12.921871

Quite low, quite low.

Variance Inflation Factor

We can calculate the VIF of each variable to see if multicollinearity is a problem. VIF scores greater than 5 indicate multicollinearity is an issue that needs to be addressed. Since there are so few variables, multicollinearity is unlikely to have much effect on the results.

The first regression:

##   Weight    Price 
## 2.105645 2.105645

What a relief! Multicollinearity, as measured by VIF, is not a big problem in the data.

The second regression:

##   Weight    Price Quantity 
## 1.542494 1.545808 1.014286

Eh, this could be a problem, but they are still pretty close to five. I opt to willfully ignore the problem. Not shown here, but I already tried transforming the variables to no avail.

The third regression:

##   Weight    Price Quantity 
## 4.210394 4.156566 1.067613

Good. These scores are not very worrying to me.

Test Data

Finally, we can evaluate model performance by using the test dataset. By default, any observation in the test dataset with a probability greater than 0.5 will be predicted to be a steer, cow, or heifer/steer (a “y”), respectively (for regressions 1, 2, and 3). We can calculate these probabilities with this code:

# REGRESSION 1 (steer vs heifer)
# calculating the probability of being a steer
# and storing the results in a vector
logit1_test_probs <- predict(logit1,
                             lajunta_logit1_test,
                             type = "response")

# REGRESSION 2 (bull vs cow)
# calculating the probability of being a cow
# and storing the results in a vector
logit2_test_probs <- predict(logit2,
                             lajunta_logit2_test,
                             type = "response")

# REGRESSSION 3 (bull/cow vs. steer/heifer)
# calculating the probability of being a
# heifer/steer (a "y") and storing the
# results in a vector
logit3_test_probs <- predict(logit3,
                             lajunta_logit3_test,
                             type = "response")

Thankfully, mathematical wizards around the world have come together to develop a magical way of determining the optimal probability that maximizes accuracy of the model. The optimalCutoff() function from the InformationValue package finds this value.

Since regressions 2 and 3 are deterministic, this process will only be done on the first regression. One problem with this approach, I suppose you could argue, is that it risks overfitting the model. You can decide for yourself what you think.

# converting "str" to 1's and "hfr" to 0's
# This is needed for the optimalCutoff() call
# below
lajunta_logit1_test$Reprod <- ifelse(lajunta_logit1_test$Reprod == "str", 1, 0)

# finding optimal cutoff probability to use to maximize accuracy
logit1_optimal <- optimalCutoff(lajunta_logit1_test$Reprod,
                                logit1_test_probs)[1]

# print the optimal probability
cat(logit1_optimal)
## 0.4778741

The optimal probability is 0.4778741. With this new threshold (the old threshold was 0.5, remember), any livestock with a probability of being a steer at or above 0.4778741 will be classified as a steer, with the rest being classified as heifers.

Confusion Matrix

Using the optimal threshold, we can create a confusion matrix showing how the model performed on the test data.

As always, let’s begin with the first regression:

##    hfr  str
## 0 1184  449
## 1  472 1496

The zeroes are heifers and the ones are steers. As we can see, 472 heifers were incorrectly identified as steers and 449 steers were incorrectly identified as heifers. Cool!

Let’s repeat the process for the second regression:

##   bull cow
## 0  317  20
## 1   10 628

The model is deterministic. Zero misclassifications!

I expect the same for the third regression:

##   par    y
## 0 951    2
## 1  24 3598

Deterministic indeed.

ROC Curves

We can also calculate the sensitivity (the “true positive rate”), specificity (the “true negative rate”), and the total number of misclassifications using the numbers from the confusion matrix. I’ll spare you the details on the sensitivity and specificity and cut straight to the error rate. Further, since we already know regressions 2 and 3 are deterministic, I won’t waste anyone’s precious time showing that the error rate is 0 percent in these two models.

The misclassification error rate for the first regression at the optimal threshold:

## 0.2558

The data were classified incorrectly on the test data 25.58 percent of the time.

Finally, we can put together ROC curves to visualize how well the models fit. The greater the AUC (area under the curve), the more accurately the model predicts.

For the first regression

The first regression seems to be a reasonably good fit, as measured by the ROC curve.

The second curve? That’s just a gigantic blue rectangle. The same is true of the ROC curve for the third regression. My intuition at the outset of this section was correct. There was no error in the third regression.

Linear Discriminant Analysis (LDA)

While we are in the neighborhood of classification models, why don’t we give LDA a go? As we all know by this point, there are four categories to be interested in classifying: bull, cow, steer, and heifer. Also, as you may have guessed, the predictors are price, weight, and quantity. Date is a strange variable, and Type has too little predictive power to bother including.

Packages

The engine for the LDA model is the one provided in the MASS package. As usual, the sampling will be done with the rsample package–in the same manner as the previous classifications.

Scale Data

Since one of the key assumptions of LDA is that each predictor has the same variance, they should all be scaled to give the data a mean of zero and a standard deviation of one. We can do this with a mutate(across()) call:

lajunta_lda <- lajunta %>% 
  mutate(across(where(is.numeric), scale),
         Date = Date)

And we can confirm that the standard deviation is, indeed, 1 after transformation:

lajunta_lda %>% 
  summarize(across(where(is.numeric), sd)) %>% 
  unlist()
## Quantity   Weight    Price 
##        1        1        1

Excellent. We are now ready to separate into training and test samples.

Samples

Due to imbalances in the number of observations in reproductive status of the livestock, stratified sampling will (again) be used. Okay, let’s make the 80/20 split:

# Setting the random number stream
set.seed(2021)
lajunta_lda_split <- rsample::initial_split(lajunta_lda,
                                            prob = 0.80,
                                            strata = Reprod)

# separating the data into two new data frames
lajunta_lda_train <- rsample::training(lajunta_lda_split)
lajunta_lda_test <- rsample::testing(lajunta_lda_split)

Build Model

Now, let’s fit the model and see the results:

lajunta_lda_fit <- MASS::lda(Reprod ~ Price + Weight + Quantity,
                             data = lajunta_lda_train)

lajunta_lda_fit
## Call:
## lda(Reprod ~ Price + Weight + Quantity, data = lajunta_lda_train)
## 
## Prior probabilities of groups:
##       bull        cow        hfr        str 
## 0.07281341 0.14103499 0.36115160 0.42500000 
## 
## Group means:
##           Price     Weight   Quantity
## bull -1.0343254  2.4402051 -0.7845741
## cow  -1.6963634  1.2925115 -0.7014357
## hfr   0.1919994 -0.4826489  0.1416393
## str   0.5668369 -0.4292520  0.2349026
## 
## Coefficients of linear discriminants:
##                 LD1       LD2        LD3
## Price     0.2117567 2.0671210 -0.2643636
## Weight   -2.2119744 1.6628994  0.2191836
## Quantity  0.3544986 0.1238638  1.0418004
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8841 0.1159 0.0000

Interpretation

Let’s break this output down bit by bit…

Prior Probabilities of Groups

These represent the proportions of each reproductive status of the livestock in the training set. Bulls, for example, make up 7.3 percent of the observations in the training dataset.

Group Means

The group means are the mean values for each predictor in the model, broken down by the possible outcomes for the dependent variable.

Coefficients of Linear Discriminants

These coefficients represent the linear combination of predictor variables that are used to form the decision rule of the model. For example,

LD1 = (0.2117567)(Price) - (2.2119744)(Weight) + (0.3544986)(Quantity)

Proportion of Trace

Displays the percentage separation achieved by each linear discriminant function (Ex. if LD1 was 0.73, we would say that “LD1 is responsible for 73% of the separation between the groups”).

Evaluation

Following our prepare-split-train process, we have now come to the step where we analyze the test data.

lajunta_lda_probs <- predict(lajunta_lda_fit,
                             lajunta_lda_test)

There are three elements of a list from this prediction–the predicted class, posterior probability than an observation belongs to each class (the probability of the observation being a steer, heifer, bull, or cow–with these four probabilities summing up to 1), and the linear discriminants.

Let’s see the first few posterior probabilities in the test set:

bull cow hfr str
0 0e+00 0.0860803 0.9139197
0 0e+00 0.0958497 0.9041503
0 0e+00 0.1042382 0.8957618
0 0e+00 0.1078766 0.8921234
0 0e+00 0.1402275 0.8597725
0 0e+00 0.1142731 0.8857269
0 0e+00 0.1472000 0.8528000
0 3e-07 0.3522007 0.6477990

I imagine that I am losing you by this point, because the outcome for these classification problems all approximate the same thing (that the graph in the introduction shows plain as day). I think it is good to use several approaches to confirm results, however, because that only strengthens our understanding of the interaction between the variables.

With that being said, observation of the above table once again confirms that the difference between heifers and steers is far more ambiguous than differences between any other cattle reproductive statuses.

The question I think you are likely most interested in, “Does LDA perform better than KNN or any of the logistic regressions,” can be answered easily by comparing the model’s classifications of the test data to the actual observations in the test set.

# For you non-R programmers out there,
# R mathematical functions translate
# TRUE's into 1's and FALSE's into 0's.
# In other words, sum(TRUE, TRUE, FALSE) returns 2.
mean(lajunta_lda_probs$class == lajunta_lda_test$Reprod)

Drum roll please! The model correctly identified the true reproductive status 77 percent of the time! Well, at least now we know that KNN is a better predictor if we want to classify reproductive status in the future! Regardless, they are both reasonably close to one another, so KNN is probably a better choice.

Often times, the people we need to convey insights to have no background in statistics; Occam’s Razor means something approaching, “the simpler explanation is usually the most correct.” Chances are that we will have a much easier time explaining KNN than we would describing linear discriminant analysis. If a simpler method does a “good-enough” job, choose the simpler method.

Visualization of Separation

Finally, we can visualize how well LDA separated the different reproductive statuses of cattle:

We could make similar plots using LD3 as well, but LD1 and LD2 are by far the most important linear discriminants. There is no real justification for including LD3.

For those of you looking to improve this model, try transforming some of the predictors!

Support Vector Machines (SVM)

I am in the mood to keep going with these classification models. I want to get a model that predicts with 95 percent accuracy, but we haven’t met that threshold with any of the preceding models. This model is similar to the logistic regression model, but instead of filtering data down to just steers and heifers, it includes cows and bulls as well. The predictors include Weight, Price, and Quantity.

The question this model answers: “is the livestock a steer?”

Packages

The e1071 package will be the engine used to implement the algorithm. The plotly package was used to create the diagram in the next section.

Data Overlap

The algorithm is supposed to classify observations by creating a hyperplane with the maximum possible margins (in a 2D setting, you can think of a plane as a line and the margins as the “buffer zone” between the points and the line–we want the largest “buffer zone” possible). Even in the 3D plane, values between steers and heifers overlap often. To remedy this, a kernel function will be used to transform the data into a higher dimension. You can examine the existing overlap yourself using the interactive diagram below (HTML file only).

Cleaning

Character variables will be transformed into factors, the Date column will become a date type, and the logical column, “is_Steer,” will be added to the data frame. Note that missing values for all variables are removed.

#note: remember that missing values for Price and Weight
#      were filtered out, and missing values for Reprod
#      were imputed using K-nearest neighbors.
lajunta_svm <- lajunta %>% 
  mutate(across(where(is.character), as.factor),
         Date = Date,
         "is_Steer" = ifelse(Reprod == "str", TRUE, FALSE)) %>% 
  filter(!is.na(Date),
         !is.na(Buyer),
         !is.na(Quantity),
         !is.na(Type),
         !is.na(is_Steer))

Sampling

Alright, let’s make the split:

# Setting the random number stream
set.seed(2021)
# transforming character vectors into factors
# and adding the boolean is_Steer column
lajunta_svm_split <- initial_split(lajunta_svm,
                                   prob = 0.80,
                                   strata = Reprod)

# separating the data into two new data frames
lajunta_svm_train <- training(lajunta_svm_split)
lajunta_svm_test <- testing(lajunta_svm_split)

Computing Support Vectors

Multiple different kernel functions will be tried to find the best fit. Radial basis functions are usually a go-to for these types of operations, and so this is the one I am most interested in.

# linear
lajunta_svm_classifier <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity, 
                              type = "C-classification", kernel = "linear")

# quadratic
lajunta_svm_classifier2 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity, 
                               type = "C-classification", kernel = "polynomial", degree = 2)

# cubic
lajunta_svm_classifier3 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity, 
                               type = "C-classification", kernel = "polynomial", degree = 3)

# sigmoid
lajunta_svm_classifier4 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity, 
                               type = "C-classification", kernel = "sigmoid")

# radial
lajunta_svm_classifier5 <- svm(data = lajunta_svm_train, is_Steer ~ Price + Weight + Quantity, 
                               type = "C-classification", kernel = "radial")

On to the fun part!

Evaluation

As usual, we now check the model’s performance using the test dataset.

Test Set

# linear
lajunta_svm_predict <- predict(lajunta_svm_classifier, newdata = lajunta_svm_test %>% 
                                 dplyr::select(Weight, Price, Quantity, is_Steer))

# quadratic
lajunta_svm_predict2 <- predict(lajunta_svm_classifier2, newdata = lajunta_svm_test %>% 
                                  dplyr::select(Weight, Price, Quantity, is_Steer))

# cubic
lajunta_svm_predict3 <- predict(lajunta_svm_classifier3, newdata = lajunta_svm_test %>% 
                                  dplyr::select(Weight, Price, Quantity, is_Steer))

# sigmoid
lajunta_svm_predict4 <- predict(lajunta_svm_classifier4, newdata = lajunta_svm_test %>% 
                                  dplyr::select(Weight, Price, Quantity, is_Steer))

# radial
lajunta_svm_predict5 <- predict(lajunta_svm_classifier5, newdata = lajunta_svm_test %>% 
                                  dplyr::select(Weight, Price, Quantity, is_Steer))

Confusion Matrix

Now, let’s see how the classifier performed with a confusion matrix:

Linear Kernel Function
FALSE TRUE
0 2089 541
1 657 1289
Quadratic Kernel Function
FALSE TRUE
0 1814 816
1 351 1595
Cubic Kernel Function
FALSE TRUE
0 1646 984
1 315 1631
Sigmoid Kernel Function
FALSE TRUE
0 1740 890
1 891 1055
Radial Kernel Function
FALSE TRUE
0 2190 440
1 470 1476

Using a linear kernel function, the model’s specificity and sensitivity are very similar. The misclassification error rate is similar to that of the quadratic and cubic polynomial kernel functions. What is different, of course, is that the models with quadratic and cubic kernel functions have many false negatives and false positives, respectively. What is interesting to note, however, is that the model with the quadratic kernel function has very few false positives and the the cubic function has very few false negatives (in short: quadratic = few false negatives and many false positives … cubic = few false positives and many false negatives).

In other contexts, we might want to be conservative in the amount of false positives we get–even if that means we do this at the expense of the overall misclassification error rate. If this was banking data and we wanted to predict whether someone was going to default on a loan, for instance, we may permit many false positives (predicted they will default, even though they won’t) in the model–rejecting people access to credit–if it means avoiding a false negative (in which the firm loses loads of money on someone who defaults, despite the model predicting the individual would not default). These considerations are important, but my goal is to minimize the overall misclassification error rate. People are not losing their businesses because a model predicts livestock of a given weight and price to be a steer when it is a heifer in reality!

The sigmoid function was the least accurate and performed almost as well as random guessing! Both sensitivity and specificity are low. This was the weakest model yet.

The radial function is the most accurate, having a misclassification error rate of 80.1 percent. Not quite the 95 percent we are looking for–and this, too, performs similarly to KNN clustering–so I think we have reached a plateau in predicting power using just three variables.

Let’s try adding variations to the model with a radial kernel function.

Optimization

There are three new variants to the model: one with only the Date variable included, one with only the Type variable included, and one with both the Date and Type variables included. You can see confusion matrices below:

Radial Kernel Function (with Date)
FALSE TRUE
0 2281 349
1 361 1585
Radial Kernel Function (with Type)
FALSE TRUE
0 2129 501
1 440 1506
Radial Kernel Function (with Date and Type)
FALSE TRUE
0 2227 403
1 405 1541

It would appear that the Type column increases misclassification error. That’s too bad–with no prior knowledge, I really thought that the type of cattle would influence the price. I spent a lot of time organizing that column. All the work that went into that was not in vain, though, since I would not have been able to disprove the fact that the type of cattle has no bearing on its price without doing the overhead.

The accuracy of the model that added only Date is 84.5 percent, which is a huge improvement over its counterpart with only the weight, price, and quantity. This is the most accurate model, and it’s pretty close to 95 percent!

Thank You!

If you’ve made it this far–you’re a trooper! I hope you enjoyed going through this as much as I did! It was written over the course of four days. If you have any questions or want to provide feedback, do not hesitate to reach out. Take care now!